import matplotlib.pyplot as plt
import numpy as np
from sympy import *
# from IPython.display import set_matplotlib_formats
# import matplotlib_inline.backend_inline
# matplotlib_inline.backend_inline.set_matplotlib_formats('png')

1. How gradient descent looks like

Gradient descent is commonly used in ML and AI. Looking at the image below, we encounter from lectures that the gradient descent is a 'process of finding the most lowest point of the function'. While this is a very concise and accurate explanation of gradient descent, understanding it in a mathematical way would enable a deeper understanding.

2. How gradient descent works: simple maths

The gradient descent image above goes through a line until it reaches the 'bottom' of the function. This 'bottom' of the function is called 'minimum'. We want to reach the minimum because it will give us the best values for our problem. One of the common use of gradient is when we estimate the parameters of a ML or AI model. We want to find the paramaters that has the smalles loss value (closes to the actual observation).

Lets consider a function $y = (x-5)^2$. We want to find the minimum value of this function by following a sequence of steps.

Start from a certain point
Move to a new point that gives a better objective value (here, smaller values)
Repeat step 2

# x = np.arange(2, 8, 0.01)
# def y(x):
#     return (x-5)**2

# fig, ax = plt.subplots()
# ax.plot(x, y(x))
# ax.set_xlabel("x")
# ax.set_ylabel("y")
# ax.plot([5], y(5), marker='o', color='black')

# fig.savefig('my_icons/03_/one.png');

One thing most explanations exclude is the constraint. Actually, for every optimization problem, there is an objective function and a constraint. The objective function is $y = (x-5)^2$. And the constraint is hidden, which is $x \in R$, $y \in R$ ($R$ means real numbers).

The optimization problem changes along the constraints, but for now lets say the constraint is $y \geq (x-5)^2$. This means the set of potential solutions for this problem is all the points above the function line.

The process of gradient descent will only happen inside the blue area (points that satisfy the constraints)

# x = np.arange(2, 8, 0.01)
# def y(x):
#     return (x-5)**2

# fig, ax = plt.subplots()
# ax.plot(x, y(x))
# ax.set_xlabel("x")
# ax.set_ylabel("y")

# ax.fill_between(x, y(x), 9, color='blue', alpha=.1)
# ax.plot([5], y(5), marker='o', color='black')

# fig.savefig('my_icons/03_/two.png')

Now lets say we start at a point where $x=7$ and $y=4$

# x = np.arange(2, 8, 0.01)
# def y(x):
#     return (x-5)**2

# fig, ax = plt.subplots()
# ax.plot(x, y(x))
# ax.set_xlabel("x")
# ax.set_ylabel("y")
# ax.fill_between(x, y(x), 9, color='blue', alpha=.1)

# ax.plot([7], y(7), marker='o', color='red')
# ax.plot([5], y(5), marker='o', color='black')

# fig.savefig('my_icons/03_/three.png')

This red dot can go anywhere as long as it is in the blue area. Lets say this dot moves to $(6,6)$. This point is still in the blue area and satisfies the constraint we have set. This is called feasibility.

The process of gradient descent ONLY considers the solutions that are feasible (satisfies constraints)

# x = np.arange(2, 8, 0.01)
# def y(x):
#     return (x-5)**2

# fig, ax = plt.subplots()
# ax.plot(x, y(x))
# ax.set_xlabel("x")
# ax.set_ylabel("y")
# ax.fill_between(x, y(x), 9, color='blue', alpha=.1)

# ax.plot([7], y(7), marker='o', color='red')
# ax.plot([6], 6, marker='o', color='green')
# ax.plot([6], 2, marker='o', color='blue')
# ax.plot([5], y(5), marker='o', color='black')

# ax.arrow(7, y(7), 6-7+0.1, 2-y(7)+0.1, head_width=0.1, head_length=0.1);

# fig.savefig('my_icons/03_/four.png')

While the point $(6,6)$ is feasible, does it go to the optimal value? The optimal value we want to find is the smallest point of $y=(x-5)^2$.

Y value when x=7 : 4
Y value when x=6 : 6

The value of the function rather increased, and it means that this is not the right direction.

If the red dot goes to the direction of blue dot, the objective function decreases. This then would be the right direction we want to go. And the 'amount' we want to move is the learning rate($\lambda$) which is a small step towards the direction.

The solution for each step is updated by $x^1 = x^0 + \lambda d^0$, and the direction is where the optimal value gets closer to the global minimum.

3. How is gradient calculated?

There are two aspects in calculating gradient. First is the 'direction', and the second is the 'distance'.
Lets say $d$ is for firection and $\alpha$ for step size ('distance'). Then this could be formulated as the following: $f(x^k + \alpha d^k) < f(x^k)$, which means the new value is smaller than the current function value.
Using Taylor's expanson, it could be formulated as $f(x^k + \alpha d^k) \approx f(x^k) + \alpha\nabla f(x^k)^T d_k$
Since $f(x^k + \alpha d^k) < f(x^k)$, we want $\alpha\nabla f(x^k)^T d_k$ to be smaller than zero. The steepest direction $d_k$ would be $-\nabla f(x^k)$ since it is the opposite direction from $\nabla f(x^k)^T$

Then, the new point is updated every step by $x^{k+1} \leftarrow x_k - \alpha \nabla f(x_k)$

This process will continue until $||\nabla f(x_k)|| \leq \epsilon$ where $\epsilon$ is a very small number. This is the point where the gradient is small enough or vanishes. This means that there is no more 'movement'.

# x = np.arange(2, 8, 0.01)
# def y(x):
#     return (x-5)**2

# fig, ax = plt.subplots()
# ax.plot(x, y(x))
# ax.set_xlabel("x")
# ax.set_ylabel("y")
# ax.fill_between(x, y(x), 9, color='blue', alpha=.1)

# ax.plot([7], y(7), marker='o', color='red')
# ax.plot([5], y(5), marker='o', color='black')

# def line1(x):
#     return 4*x - 24
# def line2(x):
#     return x*0

# ax.plot(np.arange(6, 8, 0.01), line1(np.arange(6, 8, 0.01)), color='red')
# ax.plot(x, line2(x), color='black');
# fig.savefig('my_icons/03_/five.png')

The red dot has a negative gradient, and it moves the point towards left. If the point reaches the black dot, the gradient is zero (vanishes), meaning there is no further direction for the point to move, indicating the end of the search.

4. Newton's method from scratch

One of the methods of gradient descent is Newton's method.

Let $x^k$ be the current state and consider a second order Taylor approximation of $f(x)$.
$g(x) = f(x^k)+\nabla f(x^k)^T (x-x^k) + \frac{1}{2}(x-x^k)\nabla^2f(x^k)(x-x^k)$

The next step would be to find the point that minimizes the function $g(x)$. That point would be where the first derivative equals to zero. $\nabla g(x) = \nabla f(x^k)+\nabla^2f(x^k)(x-x^k) = 0$

Solving this equation for $x$, this gives us
$x = x^k - [\nabla^2f(x_k)]^{-1} \nabla f(x^k)$ which will be the new $x^{k+1}$

To formulate this into the code, here is the psuedo code.

Parameters

$k$ is the indicator for current step
$\epsilon$ is the minimum value for stopping the search
$\alpha$ is the step size. $\alpha$ changes via line search. This means that we want to find a step size that actually leads to minimizing the function value
$\rho$ and $c$ are scalars used to derive $\alpha$ during line search. They are between 0 and 1

Code walkthrough

Set the initial values for $\alpha$, $d$, $k$. Initial starting point $x^0$ will be provided via input
The search ends only if the first derivative of $f(x)$ is smaller than $\epsilon$. Here, it is actually the norm value of the vector.
Inside the search, first determine perform line search to find $\alpha$
- If $\alpha$ is found, update $x^{k+1}$ as well as $d^{k+1}$
Continue until vanishing gradient or when it is small enough

Coding - from scratch

Lets solve the optimzal minimum value for the following function. $$y = 100(x_2-x_1^2)^2 + (1-x_1)^2$$

def plot_contour_2d(X1, X2, y):
    fig, ax = plt.subplots(figsize=(12,7))

    c_plot = ax.contour(X1, X2, y)
    ax.clabel(c_plot, inline=True, 
              fontsize=10);
    ax.set_title('2D contour plot');
    
def plot_contour_3d(X1, X2, y, save_dir):
    fig = plt.figure(figsize=(12,7))
    ax = plt.axes(projection='3d')
    ax.contour3D(X1, X2, y, 50, cmap='binary')
    ax.set_xlabel("x1")
    ax.set_ylabel("x2")
    ax.set_zlabel("y")
    ax.set_title('3D contour plot');

# X1 = np.arange(-50, 50, 0.1)
# X2 = np.arange(-50, 20, 0.1)
# X1, X2 = np.meshgrid(X1, X2)
# y = 100 * (X2 - X1**2)**2 + (1-X1)**2

# plot_contour_3d(X1, X2, y, 'my_icons/03_/3D_contour.png')

# define x and y
x1 = Symbol('x1')
x2 = Symbol('x2')
y = 100 * (x2 - x1**2)**2 + (1-x1)**2

print("Function Y")
y

Function Y

# in order for the function to be calculated with x input, it has to be 'lambdafied'
# for more information, see the package sympy
# it allows for place holders for the function, making it easy for us to read and understand

def f(x1_value, x2_value, x1=x1, x2=x2, function=y):
    """
    Input the values for x1 and x2, to the function
    """
    working_function = lambdify((x1,x2), function, 'numpy')
    return working_function(x1_value, x2_value)

def hessian_matrix(x1_value, x2_value, x1, x2, y):
    """
    Calculates the hessian matrix for 2 variable function.
    """
    matrix = np.zeros((2,2))
    for i, der1 in enumerate([x1,x2]):
        for j, der2 in enumerate([x1,x2]):
            matrix[i,j] = f(x1_value, x2_value, x1, x2, y.diff(der1).diff(der2))
    return matrix

def first_order_matrix(x1_value, x2_value, x1, x2, y):
    matrix = np.zeros((2,1))
    for i, der in enumerate([x1,x2]):
        matrix[i] = f(x1_value, x2_value, x1, x2, y.diff(der))
    return matrix

print("Partial derivative x1")
y.diff(x1)

Partial derivative x1

print("Partial derivative x2")
y.diff(x2)

Partial derivative x2

print("Partial derivative x1, x2")
y.diff(x1).diff(x2)

Partial derivative x1, x2

print("Partial derivative x2, x1")
y.diff(x2).diff(x1)

Partial derivative x2, x1

Newton's method implemented

def newton_iteration(x, x1, x2, y, verbose=False):
    """
    x: initial value
    """
    # initialization
    k = 0
    e = 10**(-4)
    # calculate hessian matrix and first derivative matrix
    hessian_m = hessian_matrix(x[0], x[1], x1, x2, y)
    first_order_m = first_order_matrix(x[0], x[1], x1, x2, y)
    # use them to calculate the gradient
    d = -1 * np.dot(np.linalg.inv(hessian_m), first_order_m)
    # calculate the l2 norm of gradient. used as a stopping criteria
    gradient_norm = np.linalg.norm(first_order_m)

    # initial value for alpha and parameters for line search
    alpha = 1
    rho = 0.2
    c = 0.4
        
    if verbose:
        print("Initialized algorithm.")
        print("step: {}\nd: {}\ngradient: {}\nthreshold: {}".format(k, d, gradient_norm, e))
        print("-"*40)
    
    # run until gradient is small enough
    while gradient_norm > e:
        # perform line search -> find appropriate alpha
        while f(x[0]+alpha*d[0], x[1]+alpha*d[1]) > f(x[0], x[1]) + c*alpha*np.dot(first_order_matrix(x[0], x[1], x1, x2, y).T, d):
            alpha *= rho
        # update x
        x = x + alpha*d
        # calculate updated hessian and first order matrix
        hessian_m = hessian_matrix(x[0], x[1], x1, x2, y)
        first_order_m = first_order_matrix(x[0], x[1], x1, x2, y)
        # update d
        d = -1 * np.dot(np.linalg.inv(hessian_m), first_order_m)
        # update norm value of gradient
        gradient_norm = np.linalg.norm(first_order_m)
        k += 1
        function_value = f(x[0], x[1])
        
        if verbose:
            print("step: {}\nalpha: {}\nx: {}\nupdated function value: {}".format(k, alpha, x, function_value[0]))
            print("d: {}\ngradient norm: {}\nthreshold: {}".format(d, gradient_norm, e))
            print()
    return x,k

x, k = newton_iteration(np.array([[1.1],[1.1]]), x1, x2, y, False)

print("Function converged at step {}. \nx = {}".format(k, x))

Function converged at step 44. 
x = [[1.00000816]
 [1.00001616]]

If the starting point is $[x1,x2]^T=[1.2, 1.2]^T$, the function converges to $[x1,x2]^T\approx[1, 1]^T$ at step 44.

x, k = newton_iteration(np.array([[6],[4]]), x1, x2, y, False)

print("Function converged at step {}. \nx = {}".format(k, x))

Function converged at step 491. 
x = [[1.00001367]
 [1.00002717]]

If the starting point is $[6, 4]^T$, it takes more iterations (491 steps), but the $x$ value is similar $[x1,x2]^T\approx[1, 1]^T$

Since the gradient is an iterative process, and optimizing a function get exponentially hard as the function gets complicated, the exact number would differ depending on where the iteration starts. That is also why there are local and global minimuns.

The gradient descent process is printed out for your reference. You can see the gradient decreasing as the function value decreases, while $x$ converges towards 1

x, k = newton_iteration(np.array([[1.1],[1.1]]), x1, x2, y, True)

Initialized algorithm.
step: 0
d: [[-0.00434783]
 [ 0.10043478]]
gradient: 53.34753977457635
threshold: 0.0001
----------------------------------------
step: 1
alpha: 1
x: [[1.09565217]
 [1.20043478]]
updated function value: 0.009149374108868819
d: [[-0.0952919 ]
 [-0.20879466]]
gradient norm: 0.19962485729758325
threshold: 0.0001

step: 2
alpha: 0.2
x: [[1.07659379]
 [1.15867585]]
updated function value: 0.0058809236765559915
d: [[-0.07120573]
 [-0.15294095]]
gradient norm: 0.3250473248156841
threshold: 0.0001

step: 3
alpha: 0.2
x: [[1.06235265]
 [1.12808766]]
updated function value: 0.003913404243300558
d: [[-0.05662774]
 [-0.11981177]]
gradient norm: 0.35423959760700396
threshold: 0.0001

step: 4
alpha: 0.2
x: [[1.0510271 ]
 [1.10412531]]
updated function value: 0.0026321371824487164
d: [[-0.04611446]
 [-0.09640244]]
gradient norm: 0.3429543838077912
threshold: 0.0001

step: 5
alpha: 0.2
x: [[1.04180421]
 [1.08484482]]
updated function value: 0.0017737229394667728
d: [[-0.03792668]
 [-0.07851316]]
gradient norm: 0.3137556000113143
threshold: 0.0001

step: 6
alpha: 0.2
x: [[1.03421887]
 [1.06914219]]
updated function value: 0.0011926921395593723
d: [[-0.03129878]
 [-0.06427308]]
gradient norm: 0.27756728506417017
threshold: 0.0001

step: 7
alpha: 0.2
x: [[1.02795912]
 [1.05628757]]
updated function value: 0.000798717397382519
d: [[-0.02582888]
 [-0.0526897 ]]
gradient norm: 0.2400900376255633
threshold: 0.0001

step: 8
alpha: 0.2
x: [[1.02279334]
 [1.04574963]]
updated function value: 0.0005322515681819315
d: [[-0.021276  ]
 [-0.04316532]]
gradient norm: 0.20432202337886463
threshold: 0.0001

step: 9
alpha: 0.2
x: [[1.01853814]
 [1.03711657]]
updated function value: 0.0003528662051710774
d: [[-0.01747768]
 [-0.0353    ]]
gradient norm: 0.17175012620431226
threshold: 0.0001

step: 10
alpha: 0.2
x: [[1.0150426 ]
 [1.03005657]]
updated function value: 0.0002327782263645213
d: [[-0.01431288]
 [-0.02880145]]
gradient norm: 0.14298481269502508
threshold: 0.0001

step: 11
alpha: 0.2
x: [[1.01218003]
 [1.02429628]]
updated function value: 0.0001528529201920002
d: [[-0.01168431]
 [-0.02344112]]
gradient norm: 0.11812679939146666
threshold: 0.0001

step: 12
alpha: 0.2
x: [[1.00984316]
 [1.01960805]]
updated function value: 9.995613546068292e-05
d: [[-0.00951   ]
 [-0.01903206]]
gradient norm: 0.09698963848128284
threshold: 0.0001

step: 13
alpha: 0.2
x: [[1.00794116]
 [1.01580164]]
updated function value: 6.512846188240299e-05
d: [[-0.00771924]
 [-0.01541733]]
gradient norm: 0.07923808656277329
threshold: 0.0001

step: 14
alpha: 0.2
x: [[1.00639732]
 [1.01271818]]
updated function value: 4.2303526271417644e-05
d: [[-0.00625057]
 [-0.01246374]]
gradient norm: 0.06447413557389624
threshold: 0.0001

step: 15
alpha: 0.2
x: [[1.0051472 ]
 [1.01022543]]
updated function value: 2.7405114468215976e-05
d: [[-0.00505076]
 [-0.01005805]]
gradient norm: 0.05228928155837711
threshold: 0.0001

step: 16
alpha: 0.2
x: [[1.00413705]
 [1.00821382]]
updated function value: 1.7714179820000433e-05
d: [[-0.00407399]
 [-0.00810429]]
gradient norm: 0.042294696053979584
threshold: 0.0001

step: 17
alpha: 0.2
x: [[1.00332225]
 [1.00659296]]
updated function value: 1.1428983950133293e-05
d: [[-0.00328118]
 [-0.00652159]]
gradient norm: 0.03413701016380426
threshold: 0.0001

step: 18
alpha: 0.2
x: [[1.00266601]
 [1.00528864]]
updated function value: 7.362607422700623e-06
d: [[-0.00263936]
 [-0.0052423 ]]
gradient norm: 0.027504938021446976
threshold: 0.0001

step: 19
alpha: 0.2
x: [[1.00213814]
 [1.00424018]]
updated function value: 4.737096082738663e-06
d: [[-0.00212089]
 [-0.00421017]]
gradient norm: 0.022130294612251602
threshold: 0.0001

step: 20
alpha: 0.2
x: [[1.00171396]
 [1.00339815]]
updated function value: 3.0447323568498954e-06
d: [[-0.00170282]
 [-0.00337876]]
gradient norm: 0.017785795014394316
threshold: 0.0001

step: 21
alpha: 0.2
x: [[1.0013734]
 [1.0027224]]
updated function value: 1.955353864088912e-06
d: [[-0.00136622]
 [-0.00270989]]
gradient norm: 0.014281198855098794
threshold: 0.0001

step: 22
alpha: 0.2
x: [[1.00110016]
 [1.00218042]]
updated function value: 1.2549004612078538e-06
d: [[-0.00109553]
 [-0.00217237]]
gradient norm: 0.011458789044168765
threshold: 0.0001

step: 23
alpha: 0.2
x: [[1.00088105]
 [1.00174594]]
updated function value: 8.049277112761418e-07
d: [[-0.00087808]
 [-0.00174077]]
gradient norm: 0.009188780777784835
threshold: 0.0001

step: 24
alpha: 0.2
x: [[1.00070544]
 [1.00139779]]
updated function value: 5.16076208111751e-07
d: [[-0.00070352]
 [-0.00139446]]
gradient norm: 0.00736499520209676
threshold: 0.0001

step: 25
alpha: 0.2
x: [[1.00056473]
 [1.0011189 ]]
updated function value: 3.307632542718407e-07
d: [[-0.0005635 ]
 [-0.00111676]]
gradient norm: 0.005900963590270291
threshold: 0.0001

step: 26
alpha: 0.2
x: [[1.00045203]
 [1.00089554]]
updated function value: 2.119323136229432e-07
d: [[-0.00045124]
 [-0.00089417]]
gradient norm: 0.00472652355749282
threshold: 0.0001

step: 27
alpha: 0.2
x: [[1.00036178]
 [1.00071671]]
updated function value: 1.3576189126097238e-07
d: [[-0.00036128]
 [-0.00071583]]
gradient norm: 0.0037849075285548606
threshold: 0.0001

step: 28
alpha: 0.2
x: [[1.00028953]
 [1.00057354]]
updated function value: 8.695187002111232e-08
d: [[-0.0002892 ]
 [-0.00057298]]
gradient norm: 0.003030290360051668
threshold: 0.0001

step: 29
alpha: 0.2
x: [[1.00023169]
 [1.00045895]]
updated function value: 5.568216012573998e-08
d: [[-0.00023148]
 [-0.00045859]]
gradient norm: 0.002425747303054435
threshold: 0.0001

step: 30
alpha: 0.2
x: [[1.00018539]
 [1.00036723]]
updated function value: 3.565348531745216e-08
d: [[-0.00018526]
 [-0.000367  ]]
gradient norm: 0.001941568396278841
threshold: 0.0001

step: 31
alpha: 0.2
x: [[1.00014834]
 [1.00029383]]
updated function value: 2.2826895374249215e-08
d: [[-0.00014825]
 [-0.00029368]]
gradient norm: 0.0015538763558939346
threshold: 0.0001

step: 32
alpha: 0.2
x: [[1.00011869]
 [1.00023509]]
updated function value: 1.4613653716375122e-08
d: [[-0.00011863]
 [-0.000235  ]]
gradient norm: 0.001243499182237761
threshold: 0.0001

step: 33
alpha: 0.2
x: [[1.00009496]
 [1.00018809]]
updated function value: 9.355013775390774e-09
d: [[-9.49261392e-05]
 [-1.88033408e-04]]
gradient norm: 0.0009950542554677025
threshold: 0.0001

step: 34
alpha: 0.2
x: [[1.00007598]
 [1.00015049]]
updated function value: 5.9883745449197485e-09
d: [[-7.59534568e-05]
 [-1.50448575e-04]]
gradient norm: 0.0007962066119685916
threshold: 0.0001

step: 35
alpha: 0.2
x: [[1.00006079]
 [1.0001204 ]]
updated function value: 3.833156859157997e-09
d: [[-6.07707990e-05]
 [-1.20372852e-04]]
gradient norm: 0.000637069775853806
threshold: 0.0001

step: 36
alpha: 0.2
x: [[1.00004863]
 [1.00009632]]
updated function value: 2.4535262533249835e-09
d: [[-4.86217830e-05]
 [-9.63072398e-05]]
gradient norm: 0.0005097227089664632
threshold: 0.0001

step: 37
alpha: 0.2
x: [[1.00003891]
 [1.00007706]]
updated function value: 1.5704134538484076e-09
d: [[-3.89007196e-05]
 [-7.70515274e-05]]
gradient norm: 0.0004078209843882173
threshold: 0.0001

step: 38
alpha: 0.2
x: [[1.00003113]
 [1.00006165]]
updated function value: 1.0051448367064757e-09
d: [[-3.11226839e-05]
 [-6.16448937e-05]]
gradient norm: 0.00032628419496853685
threshold: 0.0001

step: 39
alpha: 0.2
x: [[1.0000249 ]
 [1.00004932]]
updated function value: 6.433337797516271e-10
d: [[-2.48994967e-05]
 [-4.93182655e-05]]
gradient norm: 0.0002610448991252154
threshold: 0.0001

step: 40
alpha: 0.2
x: [[1.00001992]
 [1.00003946]]
updated function value: 4.1175465760644115e-10
d: [[-1.99204612e-05]
 [-3.94561170e-05]]
gradient norm: 0.0002088471481323487
threshold: 0.0001

step: 41
alpha: 0.2
x: [[1.00001594]
 [1.00003157]]
updated function value: 2.635337540940221e-10
d: [[-1.59369220e-05]
 [-3.15658567e-05]]
gradient norm: 0.0001670849055650914
threshold: 0.0001

step: 42
alpha: 0.2
x: [[1.00001275]
 [1.00002525]]
updated function value: 1.6866711905022221e-10
d: [[-1.27498915e-05]
 [-2.52533019e-05]]
gradient norm: 0.0001336725245054508
threshold: 0.0001

step: 43
alpha: 0.2
x: [[1.0000102]
 [1.0000202]]
updated function value: 1.0794978077835293e-10
d: [[-1.02001398e-05]
 [-2.02030360e-05]]
gradient norm: 0.00010694096373611484
threshold: 0.0001

step: 44
alpha: 0.2
x: [[1.00000816]
 [1.00001616]]
updated function value: 6.908930602668813e-11
d: [[-8.16025679e-06]
 [-1.61626813e-05]]
gradient norm: 8.555465541454276e-05
threshold: 0.0001

END